\section{Performance}

It is notoriously hard to test the performance of garbage collection
algorithms.  Nevertheless, we would like to get some idea of the time
the various phases take.  To that end, we created a few test cases
which (together with some educated guesses) will give us some ballpark
figures with respect to performance of our method, at least as
compared to other methods. 

For our tests, we chose a heap size of $2^{19}$ words of $8$ bytes
each.  This heap size was not chosen randomly.  It was chosen so that
the entire heap would fit in the cache memory of the computer used for
the tests (x86-64, 1.6GHz, GNU/Linux, SBCL), and research
\cite{Marlow:2011:MGC:2076022.1993482} indicates that a nursery this
size is a good choice.

Thus, we have the following definitions valid for all the tests:

{\small\begin{verbatim}
(defparameter *size* (ash 1 19))

(defparameter *heap* (make-array *size*))

(defparameter *bitmap*
  (make-array *size* :element-type 'bit))
\end{verbatim}}


\subsection{Phase 1, marking}

We did not attempt to test the marking phase, because it is no
different from the marking phase of any other algorithm.

\subsection{Phase 2, compaction}

To test the compaction phase, we wrote the following compaction
program:

{\small\begin{verbatim}
(defun move (heap bitmap)
  (declare (type (simple-vector #.*size*) heap)
           (type (array bit (#.*size*)) bitmap)
           (optimize (speed 3) (safety 0) (debug 0)))
  (let* ((d (position 0 bitmap))
         (s (position 1 bitmap :start d)))
    (declare (type (integer 0 #.*size*) d s))
    (loop until (= s #.*size*)
          do (setf (aref heap d) (aref heap s))
             (incf d)
             (incf s)
             (loop until (or (= s #.*size*)
                             (= (sbit bitmap s) 1))
                   do (incf s)))))
\end{verbatim}}

The hypothesis here is that the time consumed by the compaction phase
is determined by two elements:

\begin{enumerate}
\item A constant term that has to do with scanning the bitmap.
\item A term proportional to the number of elements that have to be
  moved.
\end{enumerate}

To test the hypothesis, we executed the function \texttt{move} with
three different bitmaps as shown in the following table:

\begin{tabular}{|r|r|}
\hline
Element count & CPU time (ms)\\
\hline
$1$ & $0.8$ \\
$2^{18}$ & $1.1$\\
$2^{19}-1$ & $1.4$\\
\hline
\end{tabular}

To avoid measuring the performance of the function \texttt{position},
the first line in the table was executed with a bitmap where the
element with index $1$ was equal to $1$.  The second line in the table
was executed with a bitmap where every other element was $1$.  The
third line in the table was executed with a bitmap where only element
$0$ was $0$.  In all cases, the test was run in a loop with $1000$
iterations.  The loop was the argument of the \texttt{time} macro, and
the right column of the table shows the result returned by
\texttt{time} divided by $1000$. 

The hypothesis above seems confirmed where scanning the bitmap seems
to take around $0.8$ms and moving an element seems to take around
$1$ns per element.  

Now, in a typical collection cycle, at least around half of the
elements of the heap would be dead (if not, objects would be
\emph{promoted}).  Furthermore, since objects that are allocated
together die together, the bitmap will contain long runs of $0$s and
long runs of $1$s, and because objects either die young or survive a
long time, the bitmap will typically start with a run of $1$s.  A
slightly more clever implementation of the bitmap would then test $64$
bits at a time rather than individual bits like we do in our test.  In
such an implementation, the time for managing the bitmap would be
negligible.  In a typical collection cycle, significantly fewer than
half of the elements would have to be moved.  The expected compaction
time would therefore be closer to $0.3$ms. 

\subsection{Phase 3, building the table}

In order to test the performance of building the table, we used the
following function:

{\small\begin{verbatim}
(defun build-table (heap bitmap start)
  (let ((acc 0)
        (end start))
    (declare (type fixnum acc start end)
             (type (simple-vector #.*size*) heap)
             (type (simple-array bit (#.*size*)) bitmap)
             (optimize (speed 3) (safety 0) (debug 0)))
    (loop for address of-type fixnum from 0 
          for prev of-type bit = 1 then bit
          for bit of-type bit across bitmap
          do (when (= bit 0)
               (when (= prev 1)
                 (setf (svref heap end) address)
                 (setf (svref heap (1+ end)) acc)
                 (incf end 2))
               (incf acc))
          finally (when (= prev 1)
                    (setf (svref heap end) address)
                    (setf (svref heap (1+ end)) acc)
                    (incf end 2))
                  (return end))))
\end{verbatim}}

In that program, the parameter \texttt{start} contains the total
number of live words in the heap, which is the same as the start index
into the heap where the table should be built. 

As for the compaction phase, the hypothesis here is that the time
consumed by the compaction phase is determined by two elements:

\begin{enumerate}
\item A constant term that has to do with scanning the bitmap.
\item A term proportional to the number of entries in the table to be
  built, which is the same as the number of zones of dead elements.
\end{enumerate}

We further hypothesize that the performance of this phase is
independent of the \emph{size} of the zones of dead elements. 

We start by testing the last hypothesis.  For that, we run the
function \texttt{build-table} above with two different bitmaps, both
containing $2^{10}$ zones of dead objects and the same number of zones
of live objects.  In the first test, the size of each live zone is a
single word.  In the second case, the size of each live zone is $2^8$
words.  The result of this test is shown in the following table:

\begin{tabular}{|r|r|}
\hline
Live zone size & CPU time (ms)\\
\hline
$1$ & $1.0$\\
$2^8$ & $1.0$\\
\hline
\end{tabular}

As we can see, the hypothesis that the sizes of the zones do not
influence the performance of this phase is confirmed.  

Next we tested the function \texttt{build-table} for different sizes
of the table to be built.  The size of each live zone was a single
word.  The following table shows the result:

\begin{tabular}{|r|r|}
\hline
Table size & CPU time (ms)\\
\hline
$2^{1}$ & $1.0$\\
$2^{9}$ & $1.0$\\
$2^{11}$ & $1.0$\\
$2^{13}$ & $1.1$\\
$2^{15}$ & $1.4$\\
$2^{17}$ & $1.2$\\
\hline
\end{tabular}

It appears that the time to build the table is entirely dominated by
the time to traverse the bitmap.  We have no explanation as to why the
time does not increase monotonically with the table size, but we have
run this test numerous times and we still get this surprising result.
It appears that this anomaly does not manifest itself on computers
other than ours.

With improved bitmap management code, we believe that building the
table will typically take about the same as compacting the heap, i.e.,
around $0.3$ms. 

\subsection{Phase 3, adjusting pointers}

In the worst case, every live pointer must be updated, and each such
pointer requires $n$ iterations to search the break table, where $n$
is the depth of that table.  The number of live pointers is the
difference of the size of the heap and the number of words required
for the break table.  In the worst case, the number of words required
for the break table is as small as possible (which is $2^{n+1}$), so
that the number of live pointers is $2^N - 2^{n+1}$.  The total amount
of work required can therefore be expressed as $n \cdot (2^N -
2^{n+1})$.  When $N = 19$, this function has a maximum for $n = 15$.

We can get an upper bound on the time to adjust the pointers by
creating a heap with roughly $2^{19} - 2^{16}$ words containing random
values in some interval $[0,K]$ followed by a break table with
$2^{16}$ words, with addresses in the interval $[0,K+1]$.

In order to search the break table, we used the following binary
search function:

{\small\begin{verbatim}
(defun binary-search (heap first last address)
  (declare (type (simple-vector #.*size*) heap)
           (type (integer 0 (#.*size*)) 
                 first last address)
           (optimize (speed 3) (debug 0) (safety 0)))
  (let ((f first)
        (l last))
    (declare (type (integer 0 (#.*size*)) f l))
    (loop until (= (- l f) 2)
          do (let* ((mid (1+ (ash (ash (+ l f) -2) 1)))
                    (elt (svref heap mid)))
               (declare (type fixnum elt))
               (if (< address elt)
                   (setf l mid)
                   (setf f mid))))
    (svref heap (1+ f))))
\end{verbatim}}

The following function was used to adjust the pointers of the heap:

{\small\begin{verbatim}
(defun adjust ()
  (declare (optimize (speed 3) (debug 0) (safety 0)))
  (loop with w = (worst-frontier 19)
        with l = (1- (expt 2 19))
        with g = *g*
        for i from 0 below w
        do (incf (aref *g* i)
                 (binary-search g w l i))))
\end{verbatim}}

On our computer, this function takes approximately $30$ms to run,
again on a heap containing $2^{19}$ words.  While this number may seem
high, we should keep in mind that it is pessimistic in several
different ways:

\begin{itemize}
\item This test assumes that almost $90\%$ of the heap is live.  When
  there are this many live objects, they will be \emph{promoted}, so a
  typical number should be significantly less than $50\%$.
\item It assumes that there are $2^{15}$ dead zones, which would make
  each zone very small.  A more typical situation is that dead zones
  are significantly larger, so that there are significantly fewer dead
  zones. 
\item It assumes that every live word in the heap contains a pointer
  that needs to be adjusted.  More typically, many words contain
  immediate data, or pointers to older generations, and these pointers
  do not need to be adjusted.
\end{itemize}

Furthermore, there is a simple optimization that we think can
eliminate a large fraction of the table searches.  By keeping the
result ($a_i$, $d_i$, and $a_{i+1}$) between iterations and comparing
the pointer to be adjusted to these values, we think that in many
cases we will get a hit, simply because it is likely that two
consecutive pointers belong to the same live zone and refer to objects
in that zone.  

An educated guess, then, is that a typical value for this phase would
be closer to $5$ms. 

\subsection{Overall performance}

Taken together, the performance figures in the previous section
suggest that the entire compaction phase of our suggested
\emph{mark-and-compact} sliding collector would take around $6$ms on a
fairly modest desktop, with an absolute worst-case of around $30$ms.

Including the mark phase, there are two possibilities:

\begin{itemize}
\item Compaction dominates over marking.  Then the figures given above
  are reasonable estimates of the total time for the entire
  collection.
\item Marking dominates over compaction.  Then the figures given above
  are sufficiently good that our method is no more expensive than
  others such as mark-and-sweep and copying. 
\end{itemize}

In addition, we should not forget the reasons for choosing a
sliding collector in the first place:

\begin{itemize}
\item Compared to a copying collector, only half the space is
  required, or, alternatively, the space available in the nursery is
  twice as big.  When this is the case, the number of required
  collections decreases, a situation which gives more time for
  short-lived objects to die, thereby decreasing the total amount of
  work spent in garbage collection.
\item Also compared to a copying collector, the relative ages of
  objects are kept intact, which decreases the risk of promoting
  objects that will die soon after a collection. 
\item Compaction eliminates the potential problem of fragmentation. 
\end{itemize}

All these factor contribute to decreasing the total cost of garbage
collection, although measuring the exact impact is of course very
hard. 

It is not advisable to use our algorithm on the global heap, say a
heap that is three orders of magnitude larger than the one used in
our tests.  Even multiplying our timing results by a thousand would
imply a significant pause, but since the break table requires a
binary search, the time to search it would be proportional to the
logarithm of the size of the heap, penalizing performance even more.
To manage a global heap, we would recommend the technique by 
Kermany et al \cite{Kermany:2006:CCI:1133981.1134023}, which has a
greater overhead, but has the advantage of being incremental,
concurrent, and parallel, thus effectively eliminating significant
pauses. 

One might want to speculate about the performance of our algorithm
when the size of the heap is significantly larger than the size of the
cache, but much smaller than the global heap.  Such a situation might
be desirable if it turns out that the cost of cache misses is
compensated by the fact that more time between collections will allow
more objects to die.  For that reason, we re-ran our tests with a heap
that is $4$ times the size of the heap used for the previous tests.  We
observed a factor $10$ slowdown compared to the tests reported in this
section.  In order for this configuration to be advantageous, the
number of objects that die between two garbage-collection cycles has
to increase substantially, and whether this is the case depends on the
type of application being executed.  Only benchmarks in a final system
or simulations with real allocation traces can determine whether this
is the case.  Note, however, that even though overall throughput
\emph{might} improve with a larger heap, the time for an invocation of
the garbage collector becomes closer to what is tolerable for many
so-called \emph{soft real time} applications such as generation of
sound or images in real time. 

